{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Tce3stUlHN0L"
      },
      "source": [
        "##### Copyright 2020 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "tuOe1ymfHZPu"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qFdPvlXBOdUN"
      },
      "source": [
        "# Tokenizing with TF Text"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MfBg1C5NB3X0"
      },
      "source": [
        "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/text/guide/tokenizers\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/text/blob/master/docs/guide/tokenizers.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/text/blob/master/docs/guide/tokenizers.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView on GitHub\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/text/docs/guide/tokenizers.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca href=\"https://tfhub.dev/google/zh_segmentation/1\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/hub_logo_32px.png\" /\u003eSee TF Hub models\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "\u003c/table\u003e"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xHxb-dlhMIzW"
      },
      "source": [
        "## Overview\n",
        "\n",
        "Tokenization is the process of breaking up a string into tokens. Commonly, these tokens are words, numbers, and/or punctuation. The `tensorflow_text` package provides a number of tokenizers available for preprocessing text required by your text-based models. By performing the tokenization in the TensorFlow graph, you will not need to worry about differences between the training and inference workflows and managing preprocessing scripts.\n",
        "\n",
        "This guide discusses the many tokenization options provided by TensorFlow Text, when you might want to use one option over another, and how these tokenizers are called from within your model."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MUXex9ctTuDB"
      },
      "source": [
        "## Setup"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "z0oj4HS26x05"
      },
      "outputs": [],
      "source": [
        "!pip install -q tensorflow-text"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "alf2kDHJ60rO"
      },
      "outputs": [],
      "source": [
        "import requests\n",
        "import tensorflow as tf\n",
        "import tensorflow_text as tf_text"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "i4rfKxVvBBu0"
      },
      "source": [
        "## Splitter API\n",
        "\n",
        "The main interfaces are `Splitter` and `SplitterWithOffsets` which have single methods `split` and `split_with_offsets`. The `SplitterWithOffsets` variant (which extends `Splitter`) includes an option for getting byte offsets. This allows the caller to know which bytes in the original string the created token was created from.\n",
        "\n",
        "The `Tokenizer` and `TokenizerWithOffsets` are specialized versions of the `Splitter` that provide the convenience methods `tokenize` and `tokenize_with_offsets` respectively.\n",
        "\n",
        "Generally, for any N-dimensional input, the returned tokens are in a N+1-dimensional [RaggedTensor](https://www.tensorflow.org/guide/ragged_tensor) with the inner-most dimension of tokens mapping to the original individual strings.\n",
        "\n",
        "```python\n",
        "class Splitter {\n",
        "  @abstractmethod\n",
        "  def split(self, input)\n",
        "}\n",
        "\n",
        "class SplitterWithOffsets(Splitter) {\n",
        "  @abstractmethod\n",
        "  def split_with_offsets(self, input)\n",
        "}\n",
        "```\n",
        "\n",
        "There is also a `Detokenizer` interface. Any tokenizer implementing this interface can accept a N-dimensional ragged tensor of tokens, and normally returns a N-1-dimensional tensor or ragged tensor that has the given tokens assembled together.\n",
        "\n",
        "```python\n",
        "class Detokenizer {\n",
        "  @abstractmethod\n",
        "  def detokenize(self, input)\n",
        "}\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BhviJXy0BDoa"
      },
      "source": [
        "## Tokenizers\n",
        "\n",
        "Below is the suite of tokenizers provided by TensorFlow Text. String inputs are assumed to be UTF-8. Please review the [Unicode guide](https://www.tensorflow.org/text/guide/unicode) for converting strings to UTF-8."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eWFisXk-68BQ"
      },
      "source": [
        "### Whole word tokenizers\n",
        "\n",
        "These tokenizers attempt to split a string by words, and is the most intuitive way to split text.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-CxjAs5wOYKh"
      },
      "source": [
        "#### WhitespaceTokenizer\n",
        "\n",
        "The `text.WhitespaceTokenizer` is the most basic tokenizer which splits strings on ICU defined whitespace characters (eg. space, tab, new line). This is often good for quickly building out prototype models."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "k4a11Hlm7C4k"
      },
      "outputs": [],
      "source": [
        "tokenizer = tf_text.WhitespaceTokenizer()\n",
        "tokens = tokenizer.tokenize([\"What you know you can't explain, but you feel it.\"])\n",
        "print(tokens.to_list())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VHS6dEQ7cR9E"
      },
      "source": [
        "You may notice a shortcome of this tokenizer is that punctuation is included with the word to make up a token. To split the words and punctuation into separate tokens, the `UnicodeScriptTokenizer` should be used."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-xohhm0Q7AmN"
      },
      "source": [
        "#### UnicodeScriptTokenizer\n",
        "\n",
        "The `UnicodeScriptTokenizer` splits strings based on Unicode script boundaries. The script codes used correspond to International Components for Unicode (ICU) UScriptCode values. See: http://icu-project.org/apiref/icu4c/uscript_8h.html\n",
        "\n",
        "In practice, this is similar to the `WhitespaceTokenizer` with the most apparent difference being that it will split punctuation (USCRIPT_COMMON) from language texts (eg. USCRIPT_LATIN, USCRIPT_CYRILLIC, etc) while also separating language texts from each other. Note that this will also split contraction words into separate tokens."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "68u0XF3L6-ay"
      },
      "outputs": [],
      "source": [
        "tokenizer = tf_text.UnicodeScriptTokenizer()\n",
        "tokens = tokenizer.tokenize([\"What you know you can't explain, but you feel it.\"])\n",
        "print(tokens.to_list())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "J0Ja_h1qO7P0"
      },
      "source": [
        "### Subword tokenizers\n",
        "\n",
        "Subword tokenizers can be used with a smaller vocabulary, and allow the model to have some information about novel words from the subwords that make create it.\n",
        "\n",
        "We briefly discuss the Subword tokenization options below, but the [Subword Tokenization tutorial](https://www.tensorflow.org/text/guide/subwords_tokenizer) goes more in depth and also explains how to generate the vocab files."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BLif2owYPBos"
      },
      "source": [
        "#### WordpieceTokenizer\n",
        "\n",
        "WordPiece tokenization is a data-driven tokenization scheme which generates a set of sub-tokens. These sub tokens may correspond to linguistic morphemes, but this is often not the case.\n",
        "\n",
        "The WordpieceTokenizer expects the input to already be split into tokens. Because of this prerequisite, you will often want to split using the `WhitespaceTokenizer` or `UnicodeScriptTokenizer` beforehand."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "srIHtzU2fxCi"
      },
      "outputs": [],
      "source": [
        "tokenizer = tf_text.WhitespaceTokenizer()\n",
        "tokens = tokenizer.tokenize([\"What you know you can't explain, but you feel it.\"])\n",
        "print(tokens.to_list())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uUZe66RngCGU"
      },
      "source": [
        "After the string is split into tokens, the `WordpieceTokenizer` can be used to split into subtokens."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ISEUjIsYAl2S"
      },
      "outputs": [],
      "source": [
        "url = \"https://github.com/tensorflow/text/blob/master/tensorflow_text/python/ops/test_data/test_wp_en_vocab.txt?raw=true\"\n",
        "r = requests.get(url)\n",
        "filepath = \"vocab.txt\"\n",
        "open(filepath, 'wb').write(r.content)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "uU8wJlUfsskU"
      },
      "outputs": [],
      "source": [
        "subtokenizer = tf_text.UnicodeScriptTokenizer(filepath)\n",
        "subtokens = tokenizer.tokenize(tokens)\n",
        "print(subtokens.to_list())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ncBcigHAPFBd"
      },
      "source": [
        "#### BertTokenizer\n",
        "\n",
        "The BertTokenizer mirrors the original implementation of tokenization from the BERT paper. This is backed by the WordpieceTokenizer, but also performs additional tasks such as normalization and tokenizing to words first."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2tOz1hNhtdV2"
      },
      "outputs": [],
      "source": [
        "tokenizer = tf_text.BertTokenizer(filepath, token_out_type=tf.string, lower_case=True)\n",
        "tokens = tokenizer.tokenize([\"What you know you can't explain, but you feel it.\"])\n",
        "print(tokens.to_list())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-rb_dORMO-3t"
      },
      "source": [
        "#### SentencepieceTokenizer\n",
        "\n",
        "The SentencepieceTokenizer is a sub-token tokenizer that is highly configurable. This is backed by the Sentencepiece library. Like the BertTokenizer, it can include normalization and token splitting before splitting into sub-tokens.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0dUbFCfDCojr"
      },
      "outputs": [],
      "source": [
        "url = \"https://github.com/tensorflow/text/blob/master/tensorflow_text/python/ops/test_data/test_oss_model.model?raw=true\"\n",
        "sp_model = requests.get(url).content"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "uvsm6iuNupEZ"
      },
      "outputs": [],
      "source": [
        "tokenizer = tf_text.SentencepieceTokenizer(sp_model, out_type=tf.string)\n",
        "tokens = tokenizer.tokenize([\"What you know you can't explain, but you feel it.\"])\n",
        "print(tokens.to_list())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1TatehW0Q0qV"
      },
      "source": [
        "### Other splitters\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wqNgtoFPQ1sG"
      },
      "source": [
        "#### UnicodeCharTokenizer\n",
        "\n",
        "This splits a string into UTF-8 characters. It is useful for CJK languages that do not have spaces between words."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4GjiAnQFvIhW"
      },
      "outputs": [],
      "source": [
        "tokenizer = tf_text.UnicodeCharTokenizer()\n",
        "tokens = tokenizer.tokenize([\"What you know you can't explain, but you feel it.\"])\n",
        "print(tokens.to_list())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XHyWQcJZGOwL"
      },
      "source": [
        "The output is Unicode codepoints. This can be also useful for creating character ngrams, such as bigrams. To convert back into UTF-8 characters."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_uuyz3XC0NdU"
      },
      "outputs": [],
      "source": [
        "characters = tf.strings.unicode_encode(tf.expand_dims(tokens, -1), \"UTF-8\")\n",
        "bigrams = tf_text.ngrams(characters, 2, reduction_type=tf_text.Reduction.STRING_JOIN, string_separator='')\n",
        "print(bigrams.to_list())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oCmTbCnkQ4At"
      },
      "source": [
        "#### HubModuleTokenizer\n",
        "\n",
        "This is a wrapper around models deployed to TF Hub to make the calls easier since TF Hub currently does not support ragged tensors. Having a model perform tokenization is particularly useful for CJK languages when you want to split into words, but do not have spaces to provide a heuristic guide. At this time, we have a single segmentation model for Chinese."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "R8rWv3rAv_cb"
      },
      "outputs": [],
      "source": [
        "MODEL_HANDLE = \"https://tfhub.dev/google/zh_segmentation/1\"\n",
        "segmenter = tf_text.HubModuleTokenizer(MODEL_HANDLE)\n",
        "tokens = segmenter.tokenize([\"新华社北京\"])\n",
        "print(tokens.to_list())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cRXOToXTVCep"
      },
      "source": [
        "It may be difficult to view the results of the UTF-8 encoded byte strings. Decode the list values to make viewing easier."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XeJHbr8XVctR"
      },
      "outputs": [],
      "source": [
        "def decode_list(x):\n",
        "  if type(x) is list:\n",
        "    return list(map(decode_list, x))\n",
        "  return x.decode(\"UTF-8\")\n",
        "\n",
        "def decode_utf8_tensor(x):\n",
        "  return list(map(decode_list, x.to_list()))\n",
        "\n",
        "print(decode_utf8_tensor(tokens))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eCnKgtjYRhOK"
      },
      "source": [
        "#### SplitMergeTokenizer\n",
        "\n",
        "The `SplitMergeTokenizer` \u0026 `SplitMergeFromLogitsTokenizer` have a targeted purpose of splitting a string based on provided values that indicate where the string should be split. This is useful when building your own segmentation models like the previous Segmentation example.\n",
        "\n",
        "For the `SplitMergeTokenizer`, a value of 0 is used to indicate the start of a new string, and the value of 1 indicates the character is part of the current string."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3c-2iBiuWgjP"
      },
      "outputs": [],
      "source": [
        "strings = [\"新华社北京\"]\n",
        "labels = [[0, 1, 1, 0, 1]]\n",
        "tokenizer = tf_text.SplitMergeTokenizer()\n",
        "tokens = tokenizer.tokenize(strings, labels)\n",
        "print(decode_utf8_tensor(tokens))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "l5F0zPFDwmcb"
      },
      "source": [
        "The `SplitMergeFromLogitsTokenizer` is similar, but it instead accepts logit value pairs from a neural network that predict if each character should be split into a new string or merged into the current one."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JRWtRYMxw3oc"
      },
      "outputs": [],
      "source": [
        "strings = [[\"新华社北京\"]]\n",
        "labels = [[[5.0, -3.2], [0.2, 12.0], [0.0, 11.0], [2.2, -1.0], [-3.0, 3.0]]]\n",
        "tokenizer = tf_text.SplitMergeFromLogitsTokenizer()\n",
        "tokenizer.tokenize(strings, labels)\n",
        "print(decode_utf8_tensor(tokens))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mWrGTOzbVb8U"
      },
      "source": [
        "#### RegexSplitter\n",
        "\n",
        "The `RegexSplitter` is able to segment strings at arbitrary breakpoints defined by a provided regular expression."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Szw0QZ6ecExC"
      },
      "outputs": [],
      "source": [
        "splitter = tf_text.RegexSplitter(\"\\s\")\n",
        "tokens = splitter.split([\"What you know you can't explain, but you feel it.\"], )\n",
        "print(tokens.to_list())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uPIMvyot7GFv"
      },
      "source": [
        "## Offsets\n",
        "\n",
        "When tokenizing strings, it is often desired to know where in the original string the token originated from. For this reason, each tokenizer which implements `TokenizerWithOffsets` has a *tokenize_with_offsets* method that will return the byte offsets along with the tokens. The start_offsets lists the bytes in the original string each token starts at, and the end_offsets lists the bytes immediately after the point where each token ends. To refrase, the start offsets are inclusive and the end offsets are exclusive."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UmZ91zl87J7y"
      },
      "outputs": [],
      "source": [
        "tokenizer = tf_text.UnicodeScriptTokenizer()\n",
        "(tokens, start_offsets, end_offsets) = tokenizer.tokenize_with_offsets(['Everything not saved will be lost.'])\n",
        "print(tokens.to_list())\n",
        "print(start_offsets.to_list())\n",
        "print(end_offsets.to_list())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mVGbkB-80819"
      },
      "source": [
        "## Detokenization\n",
        "\n",
        "Tokenizers which implement the `Detokenizer` provide a `detokenize` method which attempts to combine the strings. This has the chance of being lossy, so the detokenized string may not always match exactly the original, pre-tokenized string."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iyThnPPQ0_6Q"
      },
      "outputs": [],
      "source": [
        "tokenizer = tf_text.UnicodeCharTokenizer()\n",
        "tokens = tokenizer.tokenize([\"What you know you can't explain, but you feel it.\"])\n",
        "print(tokens.to_list())\n",
        "strings = tokenizer.detokenize(tokens)\n",
        "print(strings.numpy())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iVNFPYSZ7sf1"
      },
      "source": [
        "## TF Data\n",
        "\n",
        "TF Data is a powerful API for creating an input pipeline for training models. Tokenizers work as expected with the API."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "YSykDr1d7uxr"
      },
      "outputs": [],
      "source": [
        "docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'], [\"It's a trap!\"]])\n",
        "tokenizer = tf_text.WhitespaceTokenizer()\n",
        "tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))\n",
        "iterator = iter(tokenized_docs)\n",
        "print(next(iterator).to_list())\n",
        "print(next(iterator).to_list())"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "tokenizers.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
